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Abstract 

We present algorithms to perform modular polynomial multiplication 
or modular dot product efficiently in a single machine word. We pack 
polynomials into integers and perform several modular operations with 
machine integer or floating point arithmetic. The modular polynomials 
are converted into integers using Kronecker substitution (evaluation at a 
sufficiently large integer). With some control on the sizes and degrees, 
arithmetic operations on the polynomials can be performed directly with 
machine integers or floating point numbers and the number of conversions 
can be reduced. We also present efficient ways to recover the modular 
values of the coefficients. This leads to practical gains of quite large 
constant factors for polynomial multiplication, prime field linear algebra 
and small extension field arithmetic. 

1 Introduction 

While theoretically well understood, the basic routines for linear algebra or 
polynomial arithmetic over finite fields are difficult to implement efficiently. Im- 
portant factors of speed can be gained by exploiting machine integer or floating- 
point arithmetic and ta king cache eff e cts in t o acco unt. This has been demon- 
strated for instance by iDumas et al. (2002; 2004) who wrapped cache-aware 



routines for efficient small finite field linear algebra in their FFLAS/FFPACK 
project. 

The elements of small enough prime fields can be represented as integers 
fitting in a machine word or half-word, or in exact floating point numbers. For 
extension fields, the elements can be represented as polynomials over a prime 
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field. A further compr ession is proposed by Dumas et al. I (120021 : [ 2008), using 



Kronecker substitution (jGathen and Gerhard! . 11993 . §8.4): the polynomials are 
represented by their value at an integer q larger than the characteristic of the 
field. We call DQT, for Discrete Q-adic Transform, this simple map: 



DQT : Z/pZ[X] Z 

k k 
i=0 i=0 



(1) 



With some care, in particular on the size of q, it is possible to map the operations 
in the extension field into the floating point arithmetic realization of this g-adic 
representation. 

In computational number theory, matrices over F2 are often compressed 
by fitti ng several ent r ies in t o one via the binary repre sentation of machine in- 
tegers ([Coppersmith . 1993: Kaltofe n and Lobol . Il999| ). The need for efficient 
matrix computations over very small finite fields also aris es in other areas a nd 



in particular graph theory (adjacency matrices), see e.g., (|Mav et all 120071 ) or 



(jWeng et al. . 2007 ). In these cases, one is thus led to work with polynomials as 
in (Q}, using a small q. 

Recovering th e results is obtained by an inverse DQT. This consists of a 
radix conversion ( Gathen and Gerhard . 19991 Algorithm 9.14), followed by a 
reduction of the coefficients modulo p. On modern processors, machine divi- 
sion or remaindering, that are used in radix conversion, are quite slow when 
compared to other arithmetic operation^. 

In this article, we use two techniques to avoid some of these remainderings: 
(i) delayed reduction; (ii) simultaneous reduction. Delayed reduction means 
that after having been transformed into their DQT form, polynomials may un- 
dergo several arithmetic operations before being converted back into coefficient 
form. This is possible as long as the coefficients remain smaller than q, which 
is prevented by a priori bounds. Simultaneous reduction is the operation of re- 
covering ai mod p from a,*, for i = 1, . . . , A;, in the context of the inverse DQT. 
We propose a new algorithm called REDQ performing these k modular reduc- 
tions by a single division, |~-^-] additions and multiplications and some table 
look-up. We also discuss the possibility of replacing the remaining division by 
floating point operations, taking into account the rounding modes. 

We recall in Section [2] the Kronecker substitution and delayed reduction 
algorithms. Then we present our new simultaneous reduction algorithm and give 
its complexity in Section[3]and we discuss how to replace the remaining machine 
division by floating point operations with different rounding modes in Section 2J 
Our new reduction algorithm has two parts. The first one is a compression, 
performed by arithmetic operations, which reduces the size of the polynomial 
entries. The second one is performed only when required and is a correction, 



On a Xeon 3.6GHz using doubles, addition, multiplication and axpy take roughly the 
same time, while division is 10 times slower and fmod (floating point remainder) again 2.5 
times as long as division 
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which gives to the residues their correct value modulo p when the compression 
has shifted the result. We also present a time- memory trade-off enabling some 
table look-ups computing this correction. Then we apply the DQT to different 
contexts: modular polynomial multiplication in Section linear algebra over 
small extension fields in Section [SJ compressed linear algebra over small prime 
fields in Section [7] This gives some constraints on the possible choices of q and 
k. In the three applications anyway, we show that these compression techniques 
represent a speed-up factor usually of the order of the number k of residues 
stored in the compressed format. 

Prelimina r y vers ions of this work have been presented bv lDumasI f|2008h and 



Dumas et al.l J2008). Here, we give an improved version of the simultaneous 
reduction where the number of operations has been divided by two. We also 
give a complete study of the behavior of the division of integers by floating 
point routines, depending on the rounding modes. Finally, we present more 
experimental results and faster implementations of the applications, namely 
a Karatsuba version of the polynomial multiplication and a right compressed 
matrix multiplication. 



2 Q-adic Representation of Polynomials 

2.1 Kronecker Substitution 

The principle of Kronecker substitution is very simple. It consists in evaluating 
polynomials at a given integer as in Eq. |T]). For instance, for k = 2, the 
substitution is performed by the following compression: 

doublefe init3( doubled r, const double u, const double v, const double w) { 
// _dQ is a floating point storage of Q 
r=u; r*=_dQ; r+=v; r*=_dQ; return r+=w; 

} 

The integer q can be chosen to be a power of 2 in a binary architecture. Then 
the Horner like evaluation of a polynomial at q is just a left shift. One can then 
compute this shift with exponent manipulations in floating point arithmetic 
and use native shift operators (e.g., the <5C operator in C) as soon as values are 
within the 32 (or 64 when available) bit range. 

The motivation for this substitution is its use in multiplication. 

Example 1. To multiply a = X + 1 by 6 = X + 2 in Z/3Z[X] one can use the 
substitution X = q := 100; compute 101 x 102 = 10302; use radix conversion to 
write 10302 = g 2 + 3g + 2; reduce the coefficients modulo 3 to get ax 6 = X 2 + 2. 

More generally, if p is prime, a = ^2^ = q chX 1 and b — 1 ^iX % are two 
polynomials in Z/pZ[X], then one can perform the polynomial multiplication ab 
via Kronecker substitution. The product of a — 5Zi=o atq 1 and b = ^2^ = q biq 1 
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is given by 

2fc-2 / j \ 

a6 =E E^hW- ( 2 ) 

j=0 \i=0 / 

Now if g is large enough, the coefficient of does not exceed q — 1. If moreover fc 
is not too large, the product fits in a machine number (floating point number or 
integer). Thus in that case, it is possible to evaluate a and b as machine numbers, 
compute the product of these evaluations, and convert back to polynomials by 
radix conversion. There just remains to perform reductions of the coefficients 
modulo p. 

We show in Section [5] that one can also tabulate the evaluations at q, and 
that one can access directly the required part of the machine words (using e.g., 
bit fields and unions in C) instead of performing a radix conversion. 



2.2 Delayed Reduction 

With current processors, machine division (as well as modular reduction) is still 
much slower in general than machine addition and machine multiplication. It 
is possible to replace the machine division by so me other impl ementation, such 
as floating point multip l icatio n by the i nverse dShoup . 120071 ) or Montgomery 



1985); see e.g., ( DumasT2004 ) and references therein 



reduction ([Montgomery 
for more details. 

It is also important to reduce the number of machine remainderings when 
performing modular computations. This is achieved by using Kronecker sub- 
stitution seldom and perform as many arithmetic operations as possible on the 
transformed values. 



2.3 Discrete Q-adic Transform 

The idea of the Q-adic transform is to combine Kronecker substitution with 
delayed reduction. 

We call DQT the evaluation of polynomials modulo p at a sufficiently large 
q. Since this is a ring morphism, products as well as additions can be performed 
on the transformed values. We call DQT inverse the radix conversion of a g-adic 
expansion followed by a modular reduction. For a given procedure *(oi, . . . , a m ) 
performing only ring operations, the idea is to compute the DQT of the a^'s, 
perform * on these DQT's and compute the inverse DQT at the end. For 
appropriate choices of the parameters, this procedure re c overs the exact value. 
As an example, we recall the dot product of Dumas et al. ( 20021 ) in Algorithm[TJ 



The following bounds generalize those of ( Gathen and Gerhard . 19991 §8.4): 
Theorem 2. \Dumas et all \200A ) Let (3 be the number of mantissa bits avail- 



able within the machine numbers. If 



q>nk(p-\) 2 and (2k - 1) \og 2 (q) < (3, 



(3) 
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Algorithm 1 Polynomial dot product by DQT 

Input: Two vectors of polynomials v\ and v% in Z/pZ[X] n of degree less than 
k; 

Input: a sufficiently large integer q. 
Output: R £ Z/pZ[X], with R = • v 2 . 

Polynomial to q-adic conversion 
l: Set vi and V2 to the floating point vectors of the evaluations at q of the 
elements of v\ and vi. {Using e.g., Horner's formula} 

One computation 

2: Compute r — v\ T ■ V2 

Building the solution 

3: f = X)i=c> 2 fiiQ 1 - {Using radix conversion} 
4: For each i, set fii = mod p 
5: Return R = J2ito 2 ViX 1 



then Algorithm^ is correct. 

iDumas et al. I (120021 Figures 5 & 6) show that this wrapping is already a 
pretty good way to obtain high speed linear algebra over small extension fields. 
They reach high peak performance, quite close to those obtained with prime 
fields, namely 420 Millions of finite field operations per second (Mop/s) on a 
Pentium III, 735 MHz, and more than 500 Mop/s on a 64-bit DEC alpha 500 
MHz. This is roughly 20% below the pure floating point performance and 15% 
below the prime field implementation. We show in Section [6] that the new 
algorithms presented in this article enable to reduce this overhead to less than 
4%. 



3 REDQ: Modular Reduction in the DQT Do- 
main 

The first improvement we propose to the DQT is to replace the costly modular 
reduction of the polynomial coefficients (e.g., in Step 4 of Algorithm [T]) by a 
single division by p followed by several shifts. In the next section, we replace 
this division by a multiplication by an inverse. 

3.1 Examples 

We first illustrate the basic idea on a simple example. 

Example 3. Let a = X 2 + 2X + 3 and b = 4X 2 + 5X + 6 be unreduced modulo 
5, so that we want to compute a x b = 4X 4 + 3X 3 + 3X 2 + 2X + 3. We are 
free to choose the integer q at which the evaluation takes place in such a way 
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that shifting by powers of q is cheap. For clarity we take here q = 10000. Thus, 
we start from f := a x b = 40013002800270018 for which we need to reduce 
five coefficients modulo 5. In the direct approach, the coefficients would be 
recovered by computing 

4 = 0x5 + 4, 13 = 2x5 + 3, 28 = 5x5 + 3, 27 = 5x5 + 2, 18 = 3x5 + 3. 

The trick is that we can recover all the quotients at once. Indeed, we compute 
s := \f/p\ = 08002600560054003 that contains all the quotients (in boldface). 
The remainders (the coefficients) can then be recovered as it, := |/V<Z l J ~pY s l ( f \ > 
for i = 0, . . . , 4 

Some more work is needed in order to make this idea correct in general. 

Example 4. We now consider the polynomial R = 1234X 3 + 5678X 2 + 9123X + 
4567, the prime p = 23 and use q = 10 6 . Note that this time, p does not divide q. 
The polynomial we want to recover is R mod p = 15X 3 + 20X 2 + 15X + 13. We 
start from f = 1234005678009123004567 and the division gives s := [f/p\ = 
53652420783005348024. Proceeding as before, we get 

uq = 15, Mi = 8, U2 — 18, U3 = 15. 

These coefficients are small, but they are not the correct ones except for U3. 
Indeed, if C = ap + Ui, then Cq = u,g + 1 mod p so that each coefficient needs 
to be corrected to take into account the values of the preceding ones. We thus 
consider this first computation as a compression stage and use a correction stage 
that produces the correct values. The correction is obtained by taking ^ = 113 
and ^ — Ui — qui + i mod p for i — 0, 1, 2 and returning the /Vs. 

3.2 Algorithm 

Theorem 5. Algorithm REDQ is correct. 

The following lemma is probably classical. We state and prove it here for 
completeness. 

Lemma 6. For r e N and a, b £ W , 



LiJ 




-1 = 




a 




lab! 


b 



PROOF. Let k = [r/ab\, so that kab < r < (k + l)ab. Then kb < r/a < 
(k + l)b and since kb is an integer it follows that kb < [r/a\ < (k + 1)6. 
Dividing by b yields LL r / a J/H = The other side of the identity follows by 
exchanging a and b. 

PROOF, [of Theorem [5] First we prove that < m < p. Let T t = \ f/q l \. 
By Lemma E] [s/q'\ = [T t /p\, so that m = T 4 - p[Ti/p\ which proves the 
inequalities. 
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Algorithm 2 REDQ 



Input: Two integers p and q satisfying Conditions ([3]); 
Input: f = J2i=o V>i1 % e Z - 

Output: p E Z, with p — Ylt=o Mi<7* where Mi — Mi mod p. 



REDQ COMPRESSION 



p 



for i = to d do 



U,; = 

end for 



REDQ CORRECTION {when p { q} 

M<* = ""ci 

for i = to d — 1 do 

Pi = Ui - qu i+1 mod p; 
end for 

Return p = £\ =0 Mi? 1 ; 



Next we compute the value of it, by the following sequence of identities 
modulo p: 

d d 

= /W -1 = X! mod p- 

When q = mod p, we thus have Mi = pi mod p and the proof is complete. 
Otherwise, in view of the previous identity, the correction step on Line 7 pro- 
duces 

d d 

Ui - qu i+ i = ^ Mj?^ 4 - 9 X! ^jl 111 = Mi m °d p. 
j-i j=»+l 

Definition 7. We call REDQfc a simultaneous reduction of k residues performed 
by Algorithm [2] (in other words k = d + 1 if d is the degree of the Kronecker 
substitution). 

3.3 Binary Case 

When q is a power of 2 and elements are represented using an integral type, 
division by q 1 and flooring are a single operation, a right shift, or direct bit 
fields extractions when available. Moreover, the remainders can be computed 
independently and thus the loop of REDQ_COMPRESSION can be performed 
by only half of the required k axpy (combined addition and multiplication, or 
fused-mac) as shown below: 

Proposition 8. Let q be a power of two. Then, a RED Q k - COMPR ESS ION 
requires machine word multiplications and additions. 
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r/q h 
r/q 2 h 



Figure 1: REDQ 3 . COMPRESSION with 2 axpy 



PROOF. We use the notations of Algorithm__ Let b q — log 2 (g) be the number 
of bits of q — 2 bq and let (3 be the number of bits in the mantissa of a machine 
word. If the REDQfc is correct, we have q k < or b q k < ft. The first 
value Mo = f — s x p requires the whole mantissa. Now both [^rj and [^rj 
can be stored on at most k x b q — i x b q bits. Moreover, by Theorem [3] < 
L^"J — P x L^"J < L^"J so that the result does not overflow these (fc — i)b q bits. 
This proves that the operations can be done independently on the different parts 
of the machine words. Now, the total number of bits required to perform the 
loop of REDQ.COMPRESSION is 

k + 1' 

2 

This, combined with the non-overlapping of the operations, proves the proposi- 
tion. 



^ . k(k + l) 
b q }_^k-i = b q < (3 



Here is an instance of a REDQ compression with 3 residues in CH — h, using 
the syntactic sugar of bit field extraction: 

inline void REDQ_C0MP (UINT32_three& res, 

const double r, //to be reduced 

const double p){ // modulo 

_ULL64_unions rll, til; // union of 64, 17/34 or 34/17 bits 

rll._64 = static_cast<UINT64>( r ); 

tll._64 = static_cast<UINT64>( r/p ); // One division 

res. high = static_cast<UINT32>(rll . _64-tll . _64*p) ; // One axpy 
rll._17_34.low = rll. _34_17. high; // Packing 

tll._17_34.low = til. _34_17. high; // Packing 

rll._64 -= tll._64*p; // Two axpy in one 

res. mid = static_cast<UINT32> (rll . _17_34 .high) ; 
res. low = rll . _17_34 . low; 

> 

In general also the algorithm is efficient because one can precompute l/p, 
1/q, 1/q 2 etc. The computation of each m and ^ can also be pipelined or 
vectorized since they are independent. As is, the benefit when compared to 
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direct remaindering by p is that the corrections occur on smaller integers. Thus 
the remaindering by p can be faster. Actually, another major acceleration can 
be added: the fact that the \ii are much smaller than the initial /2j makes it 
possible to tabulate the corrections as shown next. 



3.4 Matrix Version of the Correction 

In Example [U the final correction can be written as a matrix vector product: 



f 


17 











1 


17 











1 


17 











1 



u mod p. 



More generally, the corrections of lines 5 to 8 of Algorithm [2] are given by a 
matrix- vector multiplication with an invertible matrix Qk'- 

1 -q ... 

'•• '•■ '•• : 



: —q 
1 



3.5 Tabulations of the Matrix- Vector Product and Time- 
Memory Trade-off 

Tabulating fully the multiplication by Qk requires a table of size at least p k+1 . 
However, the matrices Qk can be decomposed recursively into smaller similar 
matrices as follows: 



Q l 









1 









Qj 



It is therefore easy to tabulate the product by Qk with an adjustable table 
of size pP . 

Proposition 9. Let q be a power of 2. Given a table of size p? (1 < j < k + 1), 
a REDQkCORRECTION can be performed with \_(k— — 1)J table accesses. 

When q is a power of 2, the computation of the Ui in the first part of 
Algorithm [2] requires 1 div and (k + l)/2 axpy as shown in Proposition [8l Now, 
the time memory trade-off enables to compute the second part efficiently 

Example 10. We compute the corrections for a degree 6 polynomial. One can 
tabulate the multiplication by Q 6 , a 7 x 7 matrix, or actually, by only the first 



9 



6 rows of Qq, denoted by Q , with therefore p 7 entries, each of size 61og 2 (p). 
Instead, one can tabulate the multiplication by Q 2 , a 2 x 3 matrix. To compute 

[/xo, ■ ■ ■ , H&] T = Qe[uo ■ . ■ , uq] t = [Q q [uq . . . , ue] T , ue] it is sufficient to use three 
multiplications by Q 2 as shown in the following algorithm: 

Algorithm 3 Q§ with an extra memory of size p 3 
Input: [u Q . . . , Uq\ S (Z/pZ) 7 ; 

Input: a table Q 2 of the associated 2x3 matrix-vector multiplication over Z/pZ. 

Output: [jLto, ■ ■ ■ ,M6] T = <?6[uo ■ • ■ ,u e ] T . 
l: a , oi = Q 2 [uo, ui, u 2 ] T ; 

2: b ,&l = Q 2 [U2,U3,"4] T ; 
3: C ,Ci = Q 2 [U4,W5,"6] T ; 

4: Return \p , . . . ,/i 6 ] T = [a , ai, 6 , &i, c , ci, m 6 ] T ; 



Note on Indexing. In practice, indexing by a tuple of integers mod p is made 
by evaluating at p, as ^2 Uip 1 . If more memory is available, one can also directly 
index in the binary format using 

J2ui (2^(p^y. On the one hand all the 
multiplications by p are replaced by bit fields extraction. On the other hand, 
this makes the table grow a little bit, from p k to 2 r io S2 fc . 

In the rest of this article we restrict to the case when q is a power of two. 



4 Euclidean Division by Floating Point Routines 



In the implementation of REDQ above, there remains one machine division 
(Line 1). Since exact division is rather time-consuming on modern processors, 
it is natural to try and replace it by floating point operations. If r and p are 
integers and we want to compute r/p, then computing r/p by a floating point 
division with a rounding to n earest mode followed by flooring produces the 



expected value (jLefevret l2005al Theorem 1). 



Instead of a division, a multipli cation by a p recomputed inverse of p can 
be used (this is done e.g., in NTL (jShoud . 120051 . Theorem 1.5)). Ho wever, in 
that ca se, the correctness is not guaranteed for all representable r (see Lefevrei 
(|2005bh for details). Thus, as is done e.g. in NTL, one has to make some 
additional tests and corrections. In this section we propose bounds on r for 
which this correctness is guaranteed, taking the rounding modes into account. 
Outside of these bounds, we show that a difference of (at most) 1 after the 
flooring is possible for some values of r, p and the rounding modes. This can be 
detected by t wo tests on the resulting residue (below or above p) as is done by 
IShoupl ( 2007 ). We show that only one of these tests is mandatory if the rounding 



modes can be mixed. The inverse of the prime can be precomputed for each 
mode, which avoids the need for costly changes of modes. This is summarized 
in Algorithm [5] at the end of this section. 
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4.1 Rounding Modes 



We assume that the floating-point arithmetic follows the IEEE 754 standard and 
we denote by ulp(z) the unit in the last place of z for a floating-point number 
z such that 

2 /3 ~ 1 < \z\ <2 /3 -l. 

For each operation three rounding modes are possible (up A(-), down y(-) and 
neares ©(•)). If two computations take place at different times, each of them 
can be performed in a different rounding mode, so that we have 9 cases to 
consider. It is worth investigating all cases since changing rounding modes is 
a costly operation and the FPU may be in a particular rounding mode due to 
constraints in the surrounding code, or some rounding modes are simply not 
available in the particular computing environment. Also, computing the best 
bounds enables to make the best of delayed reduction in modular computations. 

4.2 Floating Point Division 

Denoting by r — kp + u the Euclidean division of r by p where < r < 2@ — 1 we 
are interested to know under which conditions on r and p Algorithm [¥] returns 
the quotient fc, depending on the rounding modes oi and o 2 . 



Algorithm 4 FDIV 

Input: One integer r such that < r < 2° — 1; 
Input: one integer p such that 1 < p < 2@ — 1; 
Input: two choices of rounding-modes oi and o 2 . 
Output: L£J- 

1: invp <— o 1 (l/p) 

2: x <— o 2 (r • invp) 

3: Return [a;J • 



4.3 Results 

Our results are summarized in Table [1] 

The column "Range" gives guaranteed bounds on the result. This shows 
which tests and corrections may be needed in order to compute the expected 
value k. The interval given in this column is optimal, in the sense that we have 
examples where \_x\ ^ k in each possible direction. 

The column "Bound on r" gives a strict upper bound on r under which 
[x\ < k, in those cases where the result could indeed overflow. We do not know 
whether these bounds are optimal; for some cases we could find a systematic 
family of examples indexed by (5 that reach the bounds asymptotically; other 
bounds could be approached by exhaustive search on small (3. These examples 
are not included here. We believe that all bounds are optimal except for case 2 

2 How ties are broken is irrelevant here. 
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where a value closer to | (instead of |) could be reached (take (3 = 2n + 1 and 
p = 2 n — 1, r = (3 • 2"~ 2 + 3)p — 1). For our purpose ^ is close enough. 

The column "Lost bits" gives a simplified version of this bound: if r fits in 
(3 minus this number of bits, then \ x\ < k. 



Case 


°i 


°2 


Range 


Bound on r 


Lost bits 


1 


A(-) 


A(0 


k < [x\ < k + 1 


2*7(4 + 2 2 -^) 


3 


2 


A(-) 


o(0 


k < [x\ < k + 1 


2^/(3 + 2^) 


2 


3 


A(-) 


v(0 


k < [x\ < k + 1 


2 p /2 


1 


4 


o(0 


A(0 


k < [x\ < k + 1 


2^/(3 + 2 1 -' 3 ) 


2 


5 


o(0 


o(0 


k - 1 < [x\ < k + 1 


2"-V(l + 2" 1 -^) 


2 


6 


o(0 


v(0 


fc - 1 < N < fc 







7 


v(0 


A(0 


k - 1 < [x\ < k + I 


2^/2 


1 


8 


v(0 


o(0 


k - 1 < [x\ < k 







9 


v(0 


v(0 


k-l<[x\ <k 








Table 1: Possible values of [x\ and bounds on r such that |_^J < k 



4.4 Proof of the Bounds on [^J 

We denote by ei and e 2 the errors in rounding as follows: 

invp = -(1 + ei), x = (r ■ invp)(\ + £2). 
P 

Thus the value that is computed is 

x= -(l + ei)(l + e 2 ) 
P 

U T 

= fcH h (ei + e 2 + eie 2 )- =: k + R (4) 

P P 

where R is the term we want to bound. For example R < 1 means [a^J < k. 
Bounds on ei and e 2 depend on the rounding mode, but in all cases we have 

N <2^, t€{l,2}. (5) 

Lemma 1. The result of Algorithm |4] is never off by more than one unit, that 
is 

k - 1 < lx\ < k + 1. 
Proof. First we show R < 2, which gives the upper bound. This is obtained by 
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injecting the inequalities ([5]), in the definition (0| of R: 



R<^- + (2 2 - f3 + 2 2 ~^)- (6) 
P P 

1 90-1 
< l-i + (2 2 -^ + 2 2 - 2/3 ^- 



P P 

= 1 - -(3- 2 2 - 2 ^). 
P 

Thus R < 2 for p > 3. In the special case p = 2, we have t\ — so that the 
result still holds. 

Similarly, in the other direction we have 

R>-(2 2 - f3 + 2 2 -^)- 
P 

> -1(4 -2 2 - 2f} ) 
P 

and R > — 1 for p > 4. For p = 2, the results follows from ex = 0. For the last 
case, p = 3, we first analyze more precisely the rounding error ex- The binary 
expansion of ^ = 0.01010101010 . . . implies that 

fi(l + 2-^- 1 ) if (3 is even and o (■) G {A(-), <>(■)}, 
°(g) = | |(1 + Z-^ 1 ) if /3 is odd and o (•) e {<>(•), V (0}, 
[i(l + 2~' 3 ) otherwise. 

Thus in all cases |ei| < 2~@. Using this better bound in the computation above 
completes the proof. □ 

Lemma 2. In cases 1, 2, 3 and 4, [a;J > k. 

Proof. In the first three cases, invp is rounded up, and thus 

r ■ invp = k{p ■ invp) + u ■ invp > k. 

This implies that o(r ■ invp) > k because rounding modes are monotone and k 
is exactly representable. 

In case 4, since |ei| < 2 _/3 , we have 

invp > (1 - 2 _/3 )-. 

P 

Then 

r ■ invp = (kp + u)invp > kp ■ invp > fc(l — 2 - ^). 

Again, fc is an integer thus exactly representable. Denote by k~ the largest 
floating-point number that is strictly less than k: 

^_ j k — iulp(fc) if k is a power of 2, 
1 k — ulp(fe) otherwise. 
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If fc is a power of 2, then the way of rounding fc(l — 2" 13 ) is the same as that of 
1 — 2~^ since fc only changes the exponent in the result. Since A(l — 2~P + d) > 1 
for any S > we get x = A(r • invp) > fc. If fc is not a power of 2, then 

fc~ = fc — ulp(fc) < fc(l — 2~' 3 ) < r ■ invp 

and thus again x = A(r ■ invp) > fc. □ 

Lemma 3. In cases 6, 8 and 9, [^J < fc- 

Proof. In case 6, we have |ei| < 2 _/3 . This implies 

r ■ invp = -(1 + ei) < -(1 + 2~ p ) <- + -<fc + l 
p p p p 

and therefore x = v( r ' invp) < fc + 1, whence [^J < fc- 
In cases 8 and 9, since invp — v(p)> we have 

r 1 

r • mwp < — < fc + 1 . (7) 

p p 

In case 9, this is sufficient to conclude. Otherwise, as in case 4 above, denote 
by (fc + 1)~ the floating-point number that is just below fc + 1 and let m be 
the midpoint between fc + 1 and (fc + 1)~. If we show r ■ invp < m, then 
o(r • invp) < (fc + 1)~ and therefore [^J < fc- 

First assume that fc + 1 is a power of 2. In this case ulp(fc + 1) = 2 1_,3 (fc + 1) 
and m = k+ 1 — |ulp(fc + 1) = (k + 1)(1 - 2 _/3_1 ). Since by assumption p < 2 13 
and r < 2^, so p(fc + 1) < r +p < 2 I3+1 and it follows that 

- > {k + l)2-^- 1 . 
P 

Together with (JTJ) this gives 

r ■ invp < (fc + 1)(1 - 2 _/3_1 ) = m. 

Assume now that fc + 1 is not a power of 2 (and fc ^ 0). Then ulp(fc + 1) = 
ulp(fc) < 2 1 ~' 3 fc, m = fc + 1 - iulp(fc) and ± > fc2-^ > ulp(fc)/2. Thus finally, 

r ■ invp < fc + 1 — — ulp(fc) < m. 

If fc = then iulp(fc + 1) = 2~ p < ± and the result follows. □ 

4.5 Bounds on r making the result exact 

In cases 6, 8 and 9, the result of Algorithm 2] is smaller than fc for all values 
of r. We now proceed to a case-by-case proof of each remaining row of Table [1] 
giving bounds on r such that this happens. 



14 



Case 1. In this case < q < 2 1 ^ for i 6 {1, 2} and thus ([5]) becomes 

P P 

so that \R\ < 1 is implied by 

1 



r < 



2 



4 + 22-/3 



that is close to 2 /3 /4 and we lose less than 3 bits compared to the bound r < 2 /3 
required to have no loss of precision on r. 

Case 2. In this case | ea | < 2~@ and ((6]) becomes 

R<P^± + (3-2-' 3 + 2 1 - 2 P)- 
P P 

The condition R < 1 is implied by 

1 

r 



3 . 2 -/3 + 2 l-2/3 < 

that is r < 2^/(3 + 2 1 ~ /3 ) which is close to 2^/3, and less that two bits are lost. 



Case 3. In this case -2^^ < e 2 < 0, \t x + e 2 | < 2 1 "/ 3 and ei£ 2 < 0. We get 

i? < 1 - - + -2 1 -' 3 
P P 

and the condition i? < 1 is ensured by 

•■<¥' 

which means we lose one bit. 

Case 4. This is as in case 2 since A(-) and o(-) play a symmetric role in the 
analysis. 

At this stage, we have obtained the following. 

Proposition 11. In Cases 1-4; Algorithm^is correct for r obeying the bounds 
of Tabled 

The remaining cases are proved similarly: 

Case 5. We have |ei| < 2~ p , |e 2 | < 2"^ and |ei + e 2 + eie 2 | < 2 1 " /3 + 2~ 2/3 . 
Then ([6]) becomes 

R< 1 + i(l - 2-^ ~2- 213 ). 
P 

The condition R < 1 is implied by r < 2 /3 ~ 1 /(l + 2~ 1 ~ /3 ) which is close to \2$ '. 
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Case 7. The bound on r follows from case 3 as ei and €2 play a symmetric 
role in the error analysis of case 3. 

4.6 Using Algorithm [4] 



Algorithm 5 Applied FDIV 

Input: r, integer such that < r < 2@ — 1; 
Input: p, integer such that 1 < p < 1® — 1; 
Output: L£J- 

Constants 

1: B A «- 2^/(3 + 2 1 -/ 3 ) 
2: B «- 20/(3 + 
3: B v 4- 2 - 1 

Precomputation 

4: invpA <— o(l/p) 
5: invpo «— A(l/p) 
6: invp^y <— A(l/p) 

Division 
7: a; <— o(r • invp Q ) 
8: y <- L^J 

Possible correction 
9: if r > B then 
10: z <— p • y 
11: if z > r then 
12: 2/ <- y - 1 
13: end if 
14: end if 
15: Return y. 



Algorithm [5] demonstrates how to apply the results of Table [T] in a program. 
We precompute l/p in several rounding modes and use the best version in the 
subsequent multiplication, depending on the current rounding mode o(-). The 
strategy used here is to make sure [x\ > k after the multiplication so only one 
test at most is needed for the correction. In addition we choose the version that 
maximizes the bound B. 

It should be noted that there is no strategy in the choice of o 1 (-) that guar- 
antees L^J < k for any choice of o 2 (-) (take o 2 (-) = A(-)), meaning that the 
described strategy is indeed the only one than minimizes the number of tests 
needed for the correction. 

In a typical application of Algorithm [5] where r is the accumulation of several 
products modulo p, the bound B can be interpreted in the numbers of operations 
that can be performed before a reduction is necessary. If this number is not 
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exceeded then the correction is never needed. 



5 Application 1: Polynomial Multiplication 

5.1 Delayed Reduction 

A classical technique for modular polynomial multiplication is to use only de- 
layed reductions. The idea is to compute e ach polynomial coefficient by a de- 



mp 

layed dotproduct e.g., as in (jDumasl . 120041 ): products of the form ^2 i aibk- 



are accumulated without reductions, and the overflow is dealt with in one final 
pass. Thus, with a centered representation modulo p for instance (integers from 
(1 — p)/2 to (p — l)/2), it is possible to accumulate at least rid products as long 
as 

n d (p-l) 2 <2^ +1 . (8) 

The final modular reduction can be performed in many different ways (e.g., 
classical division, floating point multiplication by the inverse, Montgomery re- 
duction, etc.), we just call the best one REDC here. At worst, it is equivalent 
to 1 machine division. 



5.2 Fast Q-adic Transform 

We represent modular polynomials of the form P = 5^=0 o,iX % by P — ^2 Pi (X d+1 ) ' 
where the Pi's are polynomials of degree d stored in a single integer in the g-adic 
way. 

Then a product PQ has the form 

where each multiplication PiQ t -i is computed by Algorithm [1] on a single ma- 
chine integer. The final reduction is performed by a tabulated REDQ and can 
also be delayed as long as conditions are guaranteed. 



5.3 Comparison 

We use the following complexity model: multiplications and additions in the 
field are counted as atomic operations while the machine divisions are counted 
separately. For instance, we approximate REDC by one machine division and an 
axpy. We recall that REDQfc denotes a simultaneous reduction of k residues. In 
this complexity model a REDQfc thus requires 1 division and k/2 multiplications 
and additions. A REDQ with k residues is a Kronecker substitution with a 
polynomial of degree d = k — 1 . We also call k-FQT the use of a polynomial of 
degree d = fc — 1 for the q-adic substitution. Thus a multiplication PiQ t ~i in a k- 
FQT requires the reduction of 2d+l coefficients, i.e., a REDQ 2 <j+i =REDQ 2 fc-i- 
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Let P be a polynomial of degree N with indeterminate X. If we use a 
(<i+l)-FQT, it will then become a polynomial of degree D q in the indeterminate 
Y = X d+1 ,with 

N + 1" 



D ? = 



d+ 1 



1. 



Table [2] gives the respective complexities of both strategies. The values of 
and n q are given by Equations (|3I8[) : 



with g = 2 2d ^ 





2 /3+l 




n d = 


Lo- 1) 2 _ 


, rig = 



(d + l)(p-l)s 





Mul & Add 


Reductions 


Delayed 
d-FQT 


(2iV+ l) 2 
(2D, + l) 2 


(2iV+ ] 
(2D, + 1) 


) 

"2 


2JV+ 
rid 
D, + l" 
n q 


L 


REDC 
REDQ 2 d+i 



Table 2: Modular polynomial multiplication complexities. 



Example 12. With p = 3, TV = 500, is much larger than 27V + 1 and thus 
the classical delayed polynomial multiplication algorithm requires 10 6 multipli- 
cations and additions and 10 3 remainderings. 

If we choose a double floating point representation and a 4-FQT (i.e 4 co- 
efficients in a word, or a degree 3 substitution), the fully tabulated FQT boils 
down to 8.6 • 10 4 multiplications and additions and 5.7 • 10 3 divisions. On the 
one hand, the number of operations is therefore reduced by a factor close to 11. 
On the other hand the delayed code can compute every coefficient with a single 
reduction in this case, while the FQT code has to compute less coefficients, but 
breaks the pipeline. 

Example 13. Even by switching to a larger mantissa, say e.g., 128 bits, so that 
the DQT multiplications are roughly 4 times as costly as double floating point 
operations, the FQT can still be useful. 

Taking p — 1009 and choose d — 2, this still gives around 1.3 • 10 5 multiplica- 
tions and additions over 128 bits and 7- 10 3 divisions. The number of operations 
is still reduced by a factor of 7. This should therefore still be faster than the 
delayed multiplication over 32 bits. 



On Figure[2J we compare our two implementations with that of NTL ([Should . 



20071) . We see that the FQT is faster than NTL as long as the same algorithm 
is used. This shows that our strategy is very useful for small degrees and small 
primes; not only for the classical algorithm (left) but also for subquadratic ones 
(right): the use of FQT leads to a gain of an order of magnitude. Note how- 
ever that in the special case p — 2, NTL offers a very optimized implementation 
which is still an order of magnitude faster than our general purpose implementa- 



tion: specific binary routines, such as the ones proposed by (jWeimerskirch et al 



2003), enables to pack coefficients as bits of machine words. 
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polymul/s Polynomial mutiplication modulo 3 on a Xeon 3.6GHz 

100000 




polymul/s Polynomial mutiplication modulo 3 on a Xeon 3.6GHz 

100000 r 



6384 32768 




92 16384 32768 



Figure 2: Number of classical and Karatsuba polynomial multiplications modulo 
3 per second on a Xeon 3.6 GHz (logarithmic scale) 



6 Application 2: Small Finite Field Extensions 

The isomorphism between finite fields of equal sizes gives a canonical represen- 
tation: any finite field extension is viewed as the set of polynomials modulo a 
prime p and modulo an irreducible polynomial V of degree k. Clearly we can 
thus convert any finite field element to its g-adic expansion; perform the FQT 
between two elements and then reduce the polynomial thus obtained modulo V . 
Furthermore, it is possible to use floating point routines to perform exact linear 



algebra as demonstrated by Dumas et al. (2009) 



We use the strategy of ( Dumas et all 120021 Algorithm 4.1): convert vectors 
over ¥ p k to g-adic floating point; call a fast numerical linear algebra routine 
(BLAS); convert the floating point result back to the usual field representation. 
We improve all the conversion steps as follows: 

1. replace the Horner evaluation of the polynomials, to form the g-adic ex- 
pansion, by a single table look-up, recovering directly the floating point 
representation; 

2. replace the radix conversion and the costly modular reductions of each 
polynomial coefficient, by a single REDQ operation; 

3. replace the polynomial division by two table look-ups and a single field 
operation. 

This is presented in Algorithm [SI Line 1 is the table look-up of floating point 
values associated to elements of the field; line 2 is the numerical computation; 
line 3 is the first part of the REDQ reduction; lines 4 and 5 are a time-memory 
trade-off with two table accesses for the corrections of REDQ, combined with 
a conversion from polynomials to discrete logarithm representation; the last 
line 6 combines the latter two results, inside the field. A variant of REDQ is 
used in Algorithm [6j but ui still satisfies Ui = X^=i /4j'<7 j_ * mod p as shown 
in Theorem [5] Therefore the representations of /-^A- 7 in the field can be 
precomputed and stored in two tables where the indexing will be made by 
(u , . . . ,Uk-i) and (uk-i, ■ ■ ■ ,U2k-2) and not by the fit's. 
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Algorithm 6 Fast Dot product over Galois fields via FQT and FQT inverse 
Input: A field ¥ p k with elements represented as exponents of a generator of the 
field; 

Input: two vectors V\ and V2 of elements of ¥ p k; 
Input: a sufficiently large integer q. 
Output: R E Fpfc, with R = vf ■ v%. 

Tabulated g-adic conversion 

{Use conversion tables from exponent to floating point evaluation} 
1: Set v~i and V2 to the floating point vectors of the evaluations at q of the 
elements of v± and V2 ■ 

The floating point computation 
2: Compute f — vi ■ V2; 

Computing a radix decomposition 
3: r = REDQ_COMPRESSION{f,p,q); 

Variant of REDQ.CORRECTION 

{fa is such that fa — fa mod p for f — Y^=o 2 fiil 1 } 
4: Set L = representation(Y^i-n faX 1 ). 
5: Set H = representation(X k - 1 x £i=fe-i faX 2 ~ k+1 ). 

Reduction in the field 
6: Return R= H + L g V p k ; 



Thus, this algorithm approaches the performance of the prime field wrapping 
also for small extension fields. Indeed, suppose the internal representation of the 
extension field is already by discrete logarith ms and u s es con version tables from 
polynomial to index representations (see e.g., Dumasl ( 2004 ) for details). Then 



we choose a time-memory trade-off for the REDQ operation of the same order of 
magnitude, that is to say p k . The overall memory required by these new tables 
only doubles and the REDQ requires only 2 accesses. Moreover, in the small 
extension, the polynomial multiplication must also be reduced by an irreducible 
polynomial, V . This reduction can be precomputed in the REDQ table look-up 
and is therefore almost free. Moreover, many things can be factorized if the 
field representation is by discrete logarithms. For instance, the elements are 
represented by their discrete logarithm with respect to a generator of the field, 
instead of polynomials. In this case there ar e alrea dy some table accesses for 
many arithmetic operations, see e.g., ( Dumasl . 2004 . §2.4) for details. 



Theorem 14. Algorithm^ is correct. 

PROOF. We have to prove that it is possible to compute L and H from the 
Ui's. We have /i2fe-2 = ^2fe-2 and fa — fa — qui + \ mod p, for i = 0, . . . , 2k — 3. 
Therefore a precomputed table of p k entries, indexed by (uq, . . . , Uk—i), can 
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provide the representation of 

fe-2 

L = — qui + i mod p)X l . 

i=0 

Another table with p k entries, indexed by (uk-i, ■ ■ ■ ,U2k-2), can provide the 
representation of 

2fe-3 

H = u 2k -2X 2k ~ 2 + ^2 (u % - qu i+1 mod p)X\ 

i=k—l 

Finally R = X k ~ 1 x Yntk-i M^^ fe+1 + Ei=o Mi^ needs to be reduced 
modulo the irreducible polynomial used to build the field. But, if we are given 
the representations of H and L in the field, R is then equal to their sum inside 
the field, directly using the internal representations. 

Table [3] recalls the respective complexities of the conversion phase in both 
algorithms. Here, q is a power of two and the REDQ division is computed via 
the floating point routines of Section 5J 
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> 5k 
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Table 3: Complexity of the back and forth conversion between extension field 
and floating point numbers 

Figure[3]shows the speed of the conversion after the floating point operations. 
The log scales prove that for q ranging from 2 1 to 2 26 our new implementation 
is two to three times as fast as the previous onql. 

Furthermore, these improvements allow the extension field routines to reach 
the speed of 7800 millions of Fg operations per second^ as shown on Figure I4R 
The speed-up obtained with these new implementations in also shown on this 
Figure. It represents a reduction from the 15% overhead of the previous imple- 
mentation to less than 4% now, when compared over Fn. 

3 On a 32 bit Xeon. 

4 On a XEON, 3.6 GHz , using Goto BLAS-1.09 dgemm as the numerical routine 
iGoto and van de Geiinl |2002t ) and FFLAS fgemm for the fast prime field matrix multipli- 
cation jDumaT^t**aliT2009l V 

5 T he FFLAS routines are available within the LinBox 1.1.4 library l lLinBox Groud , 
l2007tl and the FQT is in implemented in the givgfqext.h file of the Givaro 3.2.9 library 
iDumas et~an,l2007l) . 
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Figure 3: Small extension field conversion speed on a Xeon 3.6GHz 



7 Application 3: Compressed Modular Matrix 
Multiplication 



We now extend the results of iDumas et al. I (|2008f) with the REDQ algorithm. 



The idea is to use Kronecker substitution to pack several matrix entries into a 
single machine word. We explore the possibilities of packing on the left or on the 
right only, together with packing on both matrices of a matrix multiplication. 



7.1 Middle Product Algorithm 

In this section, we show how a dot product of vectors of size d + 1 can be re- 
covered from a polynomial multiplication performed by a single machine word 
multiplication. This extends to matrix multiplication by compressing both ma- 
trices first. We first illustrate the idea for 2 x 2 matrices and d = 1. The 
product 

ae + bg af + bh 
ce + dg cf + dh 

is recovered from 

* + {ae + bg)Q + *Q 2 * + {af + bh)Q + * Q 2 

* + {ce + dg)Q + * Q 2 * + {cf + dh)Q + *Q 2 

where the character * denotes other coefficients. 

In general, A is an m x k matrix to be multiplied by a k x n matrix B, 
the matrix A is first compressed into a m x \k/{d + 1)] CompressedRowMatrix, 
CA, and B is transformed into a \k{d + 1)] X n CompressedColumnMatrix, CB. 
The compressed matrices are then multiplied and the result can be extracted 
from there. This is depicted on Fig. [5] 
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Figure 4: Speed of finite field Winograd matrix multiplication on a XEON, 3.6 
GHz 
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Figure 5: Compressed Matrix Multiplication (CMM) 



In terms of number of arithmetic operations, the matrix multiplication CA x 
CB can save a factor of d + 1 over the multiplication of A x B as shown on the 
2x2 case above. 

The computation has three stages: compression, multiplication and extrac- 
tion of the result. The compression and extraction are less demanding in terms 
of asymptotic complexity, but can still be noticeable for moderate sizes. For 
this reason, compressed matrices are often reused and it is more informative to 
distinguish the three phases in an analysis. This is done in Section 171)1 ( Table |5|) . 
where the actual matrix multiplication algorithm is also taken into account. 

Partial compression. Note that the last column of CA and the last row of 
B might not have d + 1 elements if d + 1 does not divide k. Thus one has to 
artificially append some zeroes to the converted values. On CB this means just 
do nothing. On CA whose compression is reversed, this means multiplying by 
Q several times. 
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7.2 Available Mantissa and Upper Bound on Q for the 
Middle Product 

If the product CA x CB is performed with floating point arithmetic we just 
need that the coefficient of degree d fits in the j3 bits of the mantissa. Writing 
CA x CB = cnQ d + Cl, we see that this implies that ch, and only ch, must 
have entries that remain smaller than 2 /3 . It can then be recovered exactly by 
multiplication of CA x CB with the co rrectly precomputed and rounded inverse 
of Q d as shown e.g., in 1 Dumasl . 20081 Lemma 2). 



With delayed reduction this means that 

E^ T (^+i)b-i)V- i; <2^. 

On the other hand, delay reduction requires (cf. Eq |3])) 

k{p - I) 2 < Q. (9) 

Thus the recovery is possible if 

Q d+1 < 2 . (10) 
and a single reduction has to be made at the end of the dot product as follows: 

Elementfe init( Elementfe rem, const double dp) const {. 
double r = dp; 

// Multiply by the inverse of Q~d with correct rounding 
r *= _inverseQto_d; 

// Now we just need the part less than Q=2~t 
unsigned long rl( static_cast<unsigned long>(r) ); 
rl &= _QMINUS0NE; 

// And we finally perform a single modular reduction 
rl 7,= _modulus ; 

return rem = static_cast<Element> (rl) ; 

> 

Note that one can avoid the multiplication by the inverse of Q when Q is a 
power of 2, say 2*: by adding Q 2d+1 to the final result one is guaranteed that 
the t(d + 1) high bits represent exactly the d + 1 high coefficients. On the one 
hand, the floating point multiplication is then replaced by an addition. On the 
other hand, this doubles the size of the dot product and thus reduces by a factor 
of d+ \/2 the largest possible dot product size k. 

7.3 Middle Product Performance 

On Figure [6] we compare our compression algorithm to the numerical double 



floatin g point matrix multiplication dgemm of GotoBlas bv lGoto and van de Geiin 



(2002) and to the f gemm modular matrix multiplication of the FFLAS-LinBox 
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Figure 6: Compressed matrix multiplication compared with dgemm (the floating 
point double precision matrix multiplication of GotoBlas) and f gemm (the exact 
routine of FFLAS) with double or single precision. 



library bv lDumas et~ai1 ((2002). For the latter we show timings using dgemm and 
also sgemm over single floating points. This figure shows that the compression 
(d+ 1) is very effective for small primes: the gain over the double floating point 
routine is quite close to d. 

Observe that the curve of f gemm with underlying arithmetic on single floats 
oscillates and drops sometimes. Indeed, the matrix begins to be too large and 
modular reductions are now required between the recursive matrix multipli- 
cation steps. Then the floating point BLAS routines are used only when the 
sub-matrices are small enough. One can see the subsequent increase in the 
number of classical arithmetic steps on the drops around 2048, 4096 and 8192. 
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Table 4: Compression factors for different common matrix dimensions modulo 3, 
with 53 bits of mantissa and Q a power of 2. 

On Table [H we show the compression factors modulo 3, with Q a power of 2 
to speed up conversions. For a dimension n < 256 the compression is at a factor 
of five and the time to perform a matrix multiplication is slightly more than a 
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millisecond. Then from dimensions from 257 to 2048 one has a factor of 4 and 
the times are roughly 16 times the time of the four times smaller matrix. The 
next stage, from 2048 to 32768 is the one that shows on Figure [51 

Figure [5] shows the dramatic impact of the compression dropping from 4 
to 3 between n = 2048 and n = 2049. It would be interesting to compare the 
multiplication of 3-compressed matrices of size 2049 with a decomposition of the 
same matrix into matrices of sizes 1024 and 1025, thus enabling 4-compression 
also for matrices larger than 2048, but with more modular reductions. 



7.4 Right or Left Compressed Matrix Multiplication 

Another way of performing compressed matrix multiplication is to multiply an 
uncompressed m x k matrix to the right by a row-compressed k x n/(d+l) 
matrix. We illustrate the idea on 2 x 2 matrices: 

(ae + bg) + Q(af + bh)~ 
{ce + dg) + Q(cf + dh)_ 

The general case is depicted on Fig. center. This is called Right Compressed 
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Figure 7: Left, Right and Full Compressions 



Matrix Multiplication. Left Compressed Matrix Multiplication is obtained by 
transposition. Here also Q and d must satisfy Eqs. © and (fTU)) . 

The major difference with the Compressed Matrix Multiplication lies in the 
reductions. Indeed, now one needs to reduce simultaneously the d+l coefficients 
of the polynomial in Q in order to get the results. This simultaneous reduction 
can be made by the REDQ algorithm. 

When working over compressed matrices C A and CB, a first step is to un- 
compress CA, which has to be taken into account when comparing methods. 
Thus the whole right compressed matrix multiplication is the following algo- 
rithm 

A = Uncompress(CA); CC = Ax CB; REDQ(CC) (11) 
We see on Figure [S] that the used of REDQ instead of the middle product 
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algorithm has a high benefit. Indeed for small matrices, the conversion can 
represent 30% of the time and any improvement there has a high impact. 



7.5 Full Compression 

It is also possible to compress simultaneously both dimensions of the matrix 
product (see Fig. [7j right). This is achieved by using polynomial multiplication 
with two variables Q and 0. Again, we start by an example in dimension 2: 



[a + Qc b + Qd] x 



e + 0/ 
g + Oh 



= [(oe + bg) + Q(ce + dg) + 6(of + bh) + QQ(cf + dh)} 



More generally, let d q be the degree in Q and dg be the degree in 0. Then, the 
dot product is: 

d q d q do do 

a ■ b = E ° io( 5 1 ' ■ ■ ■ ' E ain ^ x E 5 °J ej ' ■ ■ ■ ' E b ^ Qj i 

i=0 i=0 j=0 j=0 

k d q do d q do k 

= EE a ^E w© 3 = E EE "<^^'(->". 

1=0 1=0 j'=0 i=0 j=0 1=0 

In order to guarantee that all the coefficients can be recovered independently, 
Q must still satisfy Eq. §§§ but then must satisfy an additional constraint: 



Q d « +1 < 0. 
This imposes restrictions on d q and dg: 

Q(dq + l)(do + l) < 



(12) 



(13) 



7.6 CMM Comparisons 

In Table [51 we summarize the differences between the algorithms presented on 
Figures [5] and [7] As usual, the exponent uj denotes the exponent in the complex- 
ity of matrix multiplication. Thus, u> = 3 for the classical m atrix multiplication , 
while uj < 3 for faster matrix multiplications, as used in (|Dumas et all 120091 
§3.2). For products of rectangular matrices, we use the classical technique of 
first decomposing the matrices into square blocks and then using fast matrix 
multiplication on those blocks. 



7.6.1 Compression Factor 

The costs in Table [5] are expressed in terms of a compression factor e, that we 
define as 



e := 



log 2 (Q) 
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where, as above, /3 is the size of the mantissa and Q is the integer chosen 
according to Eqs. ([9]) and (jlOp . except for Full Compression where the more 
constrained Eq. (TT3"j) is used. 

Thus the degree of compression for the first three algorithms is just d — e— 1, 
while it becomes only d = \fe — 1 for the full compression algorithm (with equal 
degrees d q = dg = d for both variables Q and 9). 



Algorithm 


Operations 


Reductions 


Conversions 


CMM 




to x n REDC 


imn INITe 


Right Comp. 




to x f REDQ e 


±mn EXTRACT e 

e c 


Left Comp. 




f x n REDQ e 


±mn EXTRACT e 

e e 


Full Comp. 




% x $ REDQ e 


imn INIT e 

e e 



Table 5: Number of arithmetic operations for the different algorithms 



7.6.2 Analysis 

In terms of asymptotic complexity, the cost in number of arithmetic operations 
is dominated by that of the product (column Operations in the table), while 
reductions and conversions are linear in the dimensions. This is well reflected in 
practice. For example, with algorithm Right Compression on matrices of sizes 
10, 000 x 10, 000 it took 90.73 seconds to perform the matrix multiplication mod- 
ulo 3 and 1.63 seconds to convert the resulting matrix. This is less than 2%. For 
250 x 250 matrices it takes less than 0.00216 seconds to perform the multipli- 
cation and roughly 0.0016 seconds for the conversions. There, the conversions 
account for 43% of the time and it therefore of extremely high importance to 
optimize the conversions. 

In the case of rectangular matrices, the second column of Table [5] shows that 
one should choose the algorithm depending on the largest dimension: CMM if 
the common dimension k is the largest, Right Compression if n if the largest and 
Left Compression if m dominates. The gain in terms of arithmetic operations 
is e"~ 2 for the first three variants and e~ for full compression. This is not 
only of theoretical interest but also of practical value, since the compressed 
matrices are then less rectangular. This enables more locality for the matrix 
computations and usually results in better performance. Thus, even if u = 3, 
i.e., classical multiplication is used, these considerations point to a source of 
speed improvement. 

The full compression algorithm seems to be the best candidate for locality 
and use of fast matrix multiplication; however the compression factor is an 

integer, depending on the flooring of either Xo J^(Q) or \J \oJ^(Q) ■ Thus there are 
matrix dimensions for which the compression factor of e.g., the right compression 
will be larger than the square of the compression factor of the full compression. 
There the right compression will have some advantage over the full compression. 



2N 



If the matrices are square (m = n = k) or if ui = 3, the products all 
become the same, with similar constants implied in the 0(), so that apart from 
locality considerations, the difference between them lies in the time spent in 
reductions and conversion s. Since the REDQ e reduction is faster than e classical 
reductions ijDumasl . 120081 ). and since INIT e and EXTRACT^ are roughly the 
same operations, the best algorithm would then be one of the Left, Right or Full 
compression. Further work would include implementing the Full compression 
and comparing the actual timings of conversion overhead with that of the Right 
algorithm and that of CMM. 



8 Conclusion 

We have proposed a new algorithm for simultaneous reduction of several residues 
stored in a single machine word. For this algorithm we also give a time-memory 
trade-off implementation enabling very fast running time if enough memory is 
available. 

We have shown very effective applications of this trick for packing residues in 
large applications. This proves efficient for modular polynomial multiplication, 
extension fields conversion to floating point and linear algebra routines over 
small prime fields. 

Further work is needed to compare of running times between different choices 
for q. Indeed our experiments were made with q a power of two and large table 
look-up. With q a multiple of p the table look-up is not needed but divisions by 
q % will be more expensive. A possibility would be taking q in the form q = p2 t , 
then only divisions by p or p 1 would be made. 

It would also be interesting to see in practice how this trick extends to larger 
precision implementations: on the one hand the basic arithmetic slows down, 
but on the other hand the trick enables a more compact packing of elements 
(e.g., if an odd number of field elements can be stored inside two machine words, 
etc.). 
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Finite field Winograd matrix multiplication with Goto BLAS on a XEON, 3.6 GHz 
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Figure 8: Right Compression and CMM. 
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